27 June 2017

The use of R Rstudio and Github

An introduction to reproducible workflows in R

Why?

The relevant not so distant past

For working - R

For writing - Latex

For tracking - Git

The present and the future

'modern' scientists?

Roles

  • Communication
    • Twitter
    • Popular articles
    • Public speaking
  • Interdisciplinary work
  • Collaboration
    • Beyond your lab

A tradional approach to the scientific method

  1. Devise a fancy question and call it a hypothesis

  2. Formulate a means of collecting the relevant data

  3. Import data set into statistical software package

  4. Run the procedure to get results

  5. Copy and paste appropriate pieces from the analysis into document editor

  6. Add descriptions

  7. Finish/submit report for comments

    REPEAT steps 2 - 7 after receiving comments indefinately..

Disadvantages ot this process

  • The process of data capture is not open

  • Lots of manual work (prone to make errors)

  • Tedious (who likes to carefully copy-and-paste?)

  • Likely not recordable (did you write down all the steps you followed to get your analysis?)

  • What if you made an error at the beginning of your analysis? If your data had an error? If your hypothesis was biased?

Why R?

  • R is a free software package for statistical analysis and graphics.

  • It excels in helping you with:
    • data manipulation
    • automation
    • reproducibility
    • improved accuracy
    • error finding
    • customizability
    • beautiful visualizations
    • Any downsides?

R console vs RStudio

R console is an older version that favours the command line programmer

RStudio is a powerful user interface that helps you get better control of your analysis.

  • Like R, it is also completely free.

  • You can write your entire paper/report (text, code, analysis, graphics, etc.) all in a language called R Markdown.

  • If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF file.

  • No more copy-and-paste!

Tidyverse in R

Tidying is the act of converting “messy” into “tidy” data frames

Tidyverse in R

The tidyverse is a set of packages that work in harmony

The core tidyverse packages are:

  • ggplot2, for data visualisation.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.

It also installs a selection of other tidyverse packages

R Markdown?

R Markdown?

  • “Literate programming”

  • Embed R code in a Markdown document

  • Renders textual output along with graphics

R Markdown?

Bookdown with R Markdown

Bookdown with R Markdown

Bookdown is one of the more recent additions to the R-universe.

Some highlights are:

  • Multiple output formats

  • Focus on writing the content not typesetting

  • Readers can interact with examples

  • Feedback and contributions as the book is developed

  • Integrates with version control

Thesisdown with R Markdown

Thesisdown is built from Bookdown

The current output for the four versions is here:

  • PDF
  • Word
  • ePub
  • HTML and Gitbook

Thesisdown with R Markdown - Files

Thesisdown with R Markdown - PDF

Thesisdown with R Markdown - YAML

Blogdown with R Markdown

You can now increase your on-line voice using tools developed in your research methods and present them as a blog!

  • The R package Blogdown allows you to create websites using R Markdown

The website is generated from R Markdown documents

  • all your results
  • analysis
  • graphics

can be computed and rendered dynamically from R code to your website!

Blogdown with R Markdown Yihui Xie

Blogdown with R Markdown Amber Thomas

Blogdown with R Markdown Amber Thomas

Git?

Git and Github

Git is a version control system that lets you track changes to files over time

  • Git manages the evolution of a set of files – called a repository

Github is a website for storing your git versioned files remotely

  • Github provides a home for your Git-based projects on the internet

  • If you are a student you can get the micro account which includes 5 private repositories for free!

Github

Reproducible Research

“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.”

Donald Knuth, Literate Programming (1984)

Reproducible Research

“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”

Roger Peng, Johns Hopkins

Session info

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_ZA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_ZA.UTF-8        LC_COLLATE=en_ZA.UTF-8    
##  [5] LC_MONETARY=en_ZA.UTF-8    LC_MESSAGES=en_ZA.UTF-8   
##  [7] LC_PAPER=en_ZA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_ZA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.0  backports_1.1.0 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.0     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.11   
##  [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.16      stringr_1.2.0  
## [13] digest_0.6.12   evaluate_0.10

R version 3.3.0 (2016-05-03) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.6 (El Capitan)

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] Rcpp_0.12.11 bookdown_0.4 digest_0.6.12 withr_1.0.1 rprojroot_1.2 R6_2.2.2 backports_1.0.5
[8] git2r_0.15.0 magrittr_1.5 evaluate_0.10 thesisdown_0.0.2 httr_1.2.1 stringi_1.1.5 curl_1.2
[15] rmarkdown_1.6.0.9000 devtools_1.12.0 tools_3.3.0 stringr_1.2.0 rsconnect_0.4.3 yaml_2.1.14 memoise_1.0.0
[22] htmltools_0.3.6 knitr_1.16.5

Thanks and references